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FOREWORD 


The  Army  taces  a  continuing  demand  to  meet  recruiting  quality  goals. 
Recent  advances  in  computer  technology  and  psychometric  theory  have  made  pos¬ 
sible  a  new  type  of  assessment  technique,  called  computerized  adaptive  test¬ 
ing  (CAT),  that  can  provide  accurate  ability  estimates  based  on  restively 
tew  test  items.  The  Computerized  Adaptive  Screening  Test  (CAST)  was  designed 
to  provide  an  estimate  of  a  prospect's  Armed  Forces  Qualification  Test  (APQT) 
score  at  the  recruiting  station.  Recruiters  use  CAST  to  help  determine 
whether  to  send  prospects  to  Military  Entrance  Processing  Stations  tor  further 
testing  and  to  forecast  the  various  options  and  benefits  tor  which  the  pros¬ 
pects  will  subsequently  quality.  This  report  surmarizes  analyses  from  a 
nation-wide  cross-validation  of  CAST. 

This  research  was  conducted  under  the  Manpower  and  Personnel  Research 
program  and  contributes  to  the  mission  of  the  Selection  and  Classification 
Technical  Area  to  improve  the  Army's  capability  to  select  and  classify  its 
applicants  using  state-of-the-art  and  fair  measures  to  assess  applicant  po¬ 
tential.  Continuing  research  and  development  of  CAST  is  conducted  under 
the  sponsorship  of  the  U.S.  Army  Recruiting  Caxmand  (USAREC)  as  outlined  in 
a  Memo rand un  of  Understanding  dated  29  August  1984  regarding  the  Army  Research 
Institute/USAREC  Research  and  Developnent  Program.  The  information  in  this 
report  was  briefed  to  the  Director  of  Recruiting  Operations  Directorate, 
USAREC,  on  3  September  1987.  The  results  are  being  used  to  further  document 
the  acceptability  of  using  CAST  as  a  prescreening  tool  and  to  direct  future 
refinement  efforts. 
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FINAL  REPORT  ON  A  NATIONAL  CROSS-VALIDATION  OF  THE  COMPUTERIZED  ADAPTIVE 
SCREENING  TEST  (CAST) 


EXECUTIVE  SUMMARY 


Requirement: 

To  evaluate  the  performance  of  the  Computerized  Adaptive  Screening  Test 
(CAST)  using  data  from  a  nationally  representative  sample  of  prospective 
applicants  (prospects). 

Procedure: 

A  modified  version  of  the  CAST  sofware  was  used  in  60  recruiting  stations 
across  the  country  from  January  through  December  1985  so  that  prospects'  CAST 
performance  could  be  recorded  on  data  diskettes  for  analysis.  CAST  perform¬ 
ance  information  was  collected  from  14,410  examinees.  These  data  were 
matched  to  applicant  records  from  the  Military  Entrance  Processing  Stations 
to  obtain  Armed  Services  Vocational  Aptitude  Battery  (ASVAB)  scores  and 
relevant  demographic  data  for  those  prospects  who  went  on  for  further 
testing.  Validity  data  were  examined  using  regression  and  cross-tabulation 
analyses.  In  addition,  the  item  characteristics  of  the  available  Arithmetic 
Reasoning  (AR)  and  Word  Knowledge  (WK)  item  banks  were  compared  to  those  of 
the  subset  of  items  that  were  actually  administered  to  the  CAST  examinees. 

Findings : 

The  correlation  between  CAST  and  Armed  Forces  Qualification  Test  (AFQT) 
scores  (corrected  1980  Youth  Norms)  is  .79  (N-5,909).  When  corrected  for 
restriction  in  range,  the  correlation  is  .83.  Uncorrected  correlations 
between  CAST  and  Aptitude  Area  scores  range  from  .64  to  .82.  For  81%  of  the 
examinees,  CAST  correctly  predicted  whether  or  not  they  would  score  above  the 
IIIA/IIIB  and  I I IB/I VA  AFQT  cutpoints.  Most  of  the  WK  items  available  for 
use  were  administered  more  than  15  times  during  this  data  collection  (63  out 
of  78).  Only  54  of  the  225  AR  items  were  administered  at  least  15  times  in 
this  sample  of  examinees.  The  item  characteristics  of  the  WK  item  pool  are 
more  desirable  than  the  characteristics  of  the  AR  item  pool;  however,  both 
pools  meet  minimum  psychometric  standards.  Alternative  subtest  lengths  were 
evaluated  using  multiple  correlation  and  administration  time  estimates. 

There  is  no  compelling  evidence  for  altering  the  current  subtest  length  at 
this  time. 

Utilization  of  Findings: 

This  report  will  be  used  by  the  U.S.  Army  Recruiting  Command  to  justify 
continued  use  of  CAST  as  an  enlistment  screening  test. 
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FINAL  REPORT  ON  A  NATIONAL  CROSS-VALIDATION  OF 
IHE  COMPUTERIZED  ADAPTIVE  SCREENING  TEST  (CAST) 


INTRODUCTION 

The  Computerized  Adaptive  Screening  Test  (CAST)  was  designed  by  the  Navy 
Personnel  Research  and  Development  Center  (NERDC)  with  funding  from  the  Army 
Research  Institute  (ARI)  to  provide  a  prediction  of  prospective  recruits' 

Armed  Forces  Qualification  Test  (APQT)  scores  at  recruiting  stations.  The 
purpose  of  this  report  is  to  summarize  information  about  CAST  that  has  been 
obtained  through  a  large  scale  data  collection  effort  conducted  in  1985.  The 
report  begins  with  a  brief  review  of  CAST's  history  and  concludes  with  a 
review  of  planned  modifications  of  the  test. 

History 

Applicants  for  the  U.S.  armed  services  are  required  to  take  the  Armed 
Services  Vocational  Aptitude  Battery  (ASVAB) .  The  paper-and-pencil  ASVAB  is 
composed  of  ten  subtests  and  requires  approximately  three  and  one-half  hours 
to  administer.  The  subtest  scores  are  combined  to  create  a  variety  of 
composite  scores  that  are  used  for  the  selection  and  classification  of 
enlisted  personnel.  APQT  scores  are  currently  computed  by  summing  the  word 
knowledge  (WK) ,  arithmetic  reasoning  (AR) ,  paragraph  comprehension  (PC) ,  and 
the  numerical  operations  (NO)  subtest  scores  as  follows:  WK+AR+-PC+1/2  NO. 

The  APQT  score  is  intended  to  be  a  measure  of  an  individual's  trainability  and 
is  used  to  assess  eligibility  for  enlistment  and  special  benefits.  Each 
service  uses  unique  aptitude  area  composites  to  determine  eligibility  for 
specific  military  occupations. 

To  facilitate  the  recruiting  process,  recruiters  require  information 
regarding  a  prospect's  probable  performance  on  APQT.  In  the  late  1970 's,  the 
Enlistment  Screening  Test  (EST)  was  made  available  to  recruiters  in  all  of  the 
armed  services  (Mathews  &  Ree,  1982) .  EST  is  a  traditional  paper-and-pencil 
test  that  contains  48  items  which  cure  similar  in  content  to  items  in  the 
ASVAB 's  WK,  AR,  and  PC  subtests.  In  1984,  CAST  was  made  available  to  Array 
recruiters.  Advantages  of  CAST  over  EST  include  less  test  administration  time 
and  reduced  administrative  burden  for  the  recruiter.  More  detailed  discus¬ 
sions  of  why  recruiters  use  screening  tests,  hew  they  use  the  tests,  and  the 
differences  between  EST  and  CAST  are  presented  in  earlier  CAST  reports  (Baker, 
Rafacz,  &  Sands,  1984;  Knapp  &  Pliske,  1986;  Knapp,  Pliske,  &  Elig,  1987; 
Pliske,  Gade,  &  Johnson,  1984;  Sands  &  Gade,  1983). 

Description 

CAST  is  composed  of  two  subtests:  WK  and  AR.  It  is  an  adaptive  test 
based  on  item  response  theory  (Lord,  1980) .  Thus,  the  test  items  administered 
to  a  given  examinee  are  selected  on  the  basis  of  that  examinee's  estimated 
ability  level  (known  as  theta) .  There  cure  78  items  in  the  WK  item  bank  and 
225  items  in  the  AR  item  bank.  All  CAST  items  cure  multiple  choice  with  a 
maximum  of  five  response  alternatives.  CAST  uses  the  Bayesian  sequential 
scoring  procedure  disenjssed  by  Jensema  (1977)  to  score  and  select  subsequent 


items  for  administration.  The  item  selection  procedure  incorporates  an 
element  of  randomization  that  was  intended  to  reduce  item  exposure. 


CAST  is  currently  administered  on  the  Joint  Optical  Information  Network 
(JOIN)  microcomputer  system.  JOIN  was  designed  for  the  U.S.  Army  Recruiting 
Command  (USAREC)  to  serve  a  number  of  functions  at  recruiting  stations  and 
Military  Entrance  Processing  Stations.  The  system  has  47K  of  memory  available 
for  applications  programming. 

Development 

The  item  pools  for  CAST  were  developed  and  calibrated  by  researchers  at 
the  University  of  Minnesota  (cf.  Moreno,  Wetzel,  McBride,  &  Weiss,  1984)  for 
an  experimental  version  of  a  ccrrputerized  adaptive  ASVAB  (CAT  ASVAB) .  The 
items  were  drawn  from  four  separate  calibration  efforts.  One-half  of  the  78 
WK  items  were  calibrated  on  a  sample  of  677  Marine  recruits  who  took  the  items 
via  computer.  The  remaining  WK  items  were  calibrated  using  a  sample  of 
approximately  1,300  Marine  recruits  who  took  the  items  vising  paper  and 
pencil.  One  hundred  and  forty-eight  of  the  AR  items  were  calibrated  on  a 
sanple  of  Air  Force  recruits  ranging  in  number  from  819  to  1,040  examinees  per 
item.  These  items  were  ccirputer-administered.  The  remaining  77  AR  items  were 
calibrated  on  a  sample  of  4,100  Navy  and  Marine  recruits  using  a  paper  and 
pencil  item  administration.  All  CAST  items  were  calibrated  using  a 
three-parameter  logistic  ogive  item  response  model  (Bimbaum,  1968) . 

Moreno  et  al.  (1984)  provided  a  de  facto  pilot  test  of  CAST  in  their 
research  which  examined  the  relationship  between  corresponding  ASVAB  and  CAT 
ASVAB  subtests.  These  researchers  administered  CAT  versions  of  the  WK,  AR, 
and  PC  subtests  to  270  male  Marine  recruits  at  the  Marine  Corps  Recruit  Depot 
in  San  Diego,  CA.  The  WK  and  AR  subtest  item  banks  were  the  same  as  those 
described  above.  The  data  from  this  pilot  test  yielded  a  correlation  of  .87 
between  the  three  optimally-weighted  CAT  ASVAB  subtests  and  ASVAB  AFQV. 

Because  the  Moreno  et  al.  (1984)  data  indicated  that  the  PC  subtest  did 
not  contribute  a  significant  amount  of  predictive  power  beyond  that  provided 
by  the  WK  and  Ak  subtests,  and  because  the  PC  subtest  items  required  an 
inordinate  amount  of  time  to  administer,  this  subtest  was  not  incorporated 
into  CAST.  Note  that  an  NO  subtest  was  not  considered  because  it  is  a  speeded 
test  that  does  not  lend  itself  to  an  adaptive  testing  format  and  because  it 
would  require  precise  time  limits.  Thus,  only  WK  and  AR  items  were  admini¬ 
stered  to  the  Army  applicants  who  participated  in  CAST's  field  test  at  the  Los 
Angeles  Military  Entrance  Processing  Station  (Sands  &  Gade,  1983) .  Specifi¬ 
cally,  20  WK  and  15  AR  items  were  administered  adaptively  to  312  examinees  on 
an  APP1E-II  microcomputer.  Multiple  correlation  coefficients  were  computed 
for  each  of  the  300  possible  combinations  of  subtest  lengths.  Examination  of 
these  results,  in  light  of  judgments  regarding  the  probable  administration 
time  of  the  various  subtest  lengths,  led  to  the  recommendation  that  the 
operational  CAST  be  terminated  foilwing  the  administration  of  10  WK  and  5  AR 
items.  The  multiple  correlation  between  this  optimally-weighted  subtest  score 
combination  and  actual  AFQT  scores  was  .85. 
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Farl  y  Cross-Veil  idation  Evidence 

Army  recruiting  stations  in  the  midwestem  region  of  the  U.S.  provided 
CAST  cross-validation  data  during  January  and  February  of  1984  (Pliske  et  al. , 
1984) .  At  this  point  in  time,  CAST  was  fully  operational  in  only  this  region 
of  the  country.  The  CAST  scores  provided  by  participating  recruiting  stations 
were  matched  to  ASVAB  records  available  from  the  Military  Entrance  Processing 
Command  (MEPCOM) .  CAST  and  ASVAB  data  were  available  for  1,962  individuals. 
The  bivariate  correlation  between  these  CAST  and  AFQT  scores  was  .80. 

Purpose  of  Present  Investigation 

This  project  had  several  goals.  The  first  goal  was  to  provide  a  compre¬ 
hensive  evaluation  of  the  prediction  equation  that  had  originally  been 
incorporated  into  CAST.  Recall  that  the  multiple  regression  equation  combines 
the  final  WK  and  AR  theta  estimates  to  produce  a  predicted  AFQT  percentile 
score.  The  second  goal  was  to  compute  a  new  prediction  equation  and  evaluate 
its  operation.  A  third  goal  was  to  describe  the  operational  nature  of  CAST  in 
terms  of  administration  time  and  item  pool  usage.  To  date,  the  descriptions 
of  the  two  CAST  item  pools  have  covered  all  test  items.  It  has  been  evident, 
however,  that  CAST  actually  administers  only  a  subset  of  those  items.  Thus,  a 
more  accurate  description  of  the  test  would  focus  on  the  "operational"  subset 
of  items.  Finally,  this  project  provides  the  data  required  to  evaluate  CAST's 
utility  for  predicting  performance  on  the  Army's  aptitude  area  composites. 

When  CAST  was  introduced,  the  possibility  that  it  might  be  useful  for  predict¬ 
ing  eligibility  for  training  assignments  was  raised,  however  the  relevant  data 
were  not  available  at  that  time  (Sands  &  Gade,  1983) . 

Preliminary  results  that  were  based  on  analysis  of  data  collected  during 
the  first  six  months  of  this  project  have  been  documented  in  two  reports 
(Knapp  &  Pliske,  1986;  Knapp  et  al.,  1987). 


METHOD 


Subjects 

CAST  performance  information  was  obtained  from  14,410  Army  prospects. 
Correct  AFQT  percentile  scores  could  be  obtained  for  only  41%  (rr=5,909)  of 
this  sample.  The  primary  reason  for  failure  to  obtain  AFQT  scores  for 
everyone  is  that  many  of  the  CAST  examinees  never  went  on  to  take  ASVAB.  We 
have  only  limited  information  by  which  we  can  evaluate  the  extent  to  which 
this  validation  sample  represents  the  population  of  Army  prospects.  Since 
CAST  is  a  screening  test,  the  most  obvious  concern  is  limited  variation  in  the 
CAST  performance  of  individuals  for  whom  AFQT  scores  are  available.  The  mean 
CAST  score  (i.e. ,  predicted  AFQT  percentile  score)  for  the  larger  unrestricted 
sample  is  39  (SD=20. 6)  whereas  the  mean  CAST  score  in  the  validation  sample  is 
45  (SD=17.9)  indicating  that  such  concern  is  justified.  Fortunately,  the 
availability  of  a  good  estimate  of  the  population  standard  deviation  (i.e., 
the  SD  of  the  unrestricted  sample)  permits  correction  of  validity  estimates 
for  restriction  in  range. 
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Of  additional  concern  regarding  the  validation  sairple  is  the  extent  to 
which  it  represents  the  population  of  Army'  prospects  with  respect  to  demo¬ 
graphic  characteristics .  Demographic  data  are  available  only  for  the  valida¬ 
tion  sairple  so  the  adequacy  of  the  sairple  must  be  inferred  on  the  basis  of  the 
sairple  selection  procedure  and  a  priori  expectations .  The  characteristics  of 
the  validation  sairple  are  summarized  in  Table  1.  This  information  portrays  a 
reasonable  picture  of  the  prospect  population.  It  should  be  noted  that  the  60 
recruiting  stations  that  participated  in  the  data  collection  effort  were 
selected  to  be  representative  of  all  Army  recruiting  stations  in  terms  of 
geographical  location  and  population  density.  The  sampled  stations  were  also 
selected  to  ensure  that  a  relatively  large  number  of  blade  prospects  would  be 
included.  Indeed,  the  large  percentage  of  black  prospects  in  the  validation 


Table  1 

National  CAST  Cross-Validation  (January  -  December  1985)  Sample  Description 


Sample  Size 

5,909 

Sex 

82%  Male 

18%  Female 

Race 

58%  White 

38%  Black 

4%  Other 

Age 

Mean  =  20;  SD  =  3.59 

Median  =  19 

Mode  =  18 

Component 

86%  Regular  Army 

14%  Army  Reserve 

APQfT  Category 

24%  I  and  II 

(From  ASVAB) 

17%  IIIA 

30%  IIIB 

29%  IVA  and  V 

sanple  is  the  only  aspect  of  the  sample  which  appears  at  odds  with  expecta¬ 
tions  regarding  the  relevant  population. 

Procedure 


Currently,  the  JOIN  system  is  programmed  to  record  each  examinee's  name 
and  CAST  score  onto  a  "Prospect  Data"  diskette  that  the  recruiter  keeps  for 
his  or  her  own  use.  A  modified  version  of  the  CAST  software  was  designed  to 


collect  more  detailed  information  onto  special  data  collection  diskettes  that 
were  sent  to  ARI  for  analysis.  Information  recorded  on  the  diskettes  included 
the  identification  number  of  each  test  item  administered  to  the  examinee,  the 
examinee's  answer  to  each  item,  the  time  it  took  for  the  examinee  to  read  and 
answer  each  item,  and  the  examinee's  social  security  number  (SSN) .  The 
software  was  also  changed  so  that  the  prospects  would  respond  to  five  more 
items  per  subtest  than  are  actually  used  to  compute  the  operational  test 
score. 

At  the  end  of  each  month,  during  the  12  month  data  collection  period, 
personnel  at  each  of  the  60  participating  recruiting  stations  forwarded  the 
data  collection  diskettes  to  ARI.  The  information  on  these  diskettes  was 
uploaded  to  a  mainframe  computer  system  where  it  was  put  into  a  format  that 
permitted  it  to  be  matched  to  MEPOCM  records.  MEPOCM  records  were  also 
provided  to  ARI  on  a  monthly  basis.  These  records  contained  not  only  the 
subsequent  ASVAB  (APQT  and  other  ccnposite)  scores  but  also  demographic 
information  for  each  examinee. 

Analyses 

The  large  amount  of  validation  data  available  from  this  effort  permitted 
the  cross-validation  of  CAST's  original  prediction  equation  and  the  develop¬ 
ment  and  cross-validation  of  new  prediction  equations.  A  new  prediction 
equation  was  incorporated  into  the  CAST  software  in  1986.  The  performance  of 
this  revised  algorithm  is  evaluated  in  terms  of  its  ability  to  mate  linear 
point  and  category  predictions. 

In  1986,  MEPOCM  revised  the  tables  that  are  used  to  convert  raw  AFQT 
scores  to  percentile  scores.  Accordingly,  all  APQT  scores  for  the  examinees 
in  this  investigation  were  converted  to  percentile  scores  using  the  revised 
conversion  tables.  This  procedure  resulted  in  the  loss  of  a  small  number  of 
cases  due  to  insufficient  information  required  to  perform  the  score  conver¬ 
sion.  The  revised  APQT  conversion  tables  also  affected  the  performance  of  the 
CAST  prediction  equation.  The  impact  of  this  change  will  be  described. 

In  addition  to  evaluating  CAST's  ability  to  predict  APQT  performance, 
CAST's  relationship  to  Army  aptitude  area  scores  will  be  described.  These 
analyses  are  intended  to  provide  Array  policy-makers  with  information  that 
would  help  them  evaluate  additional  uses  for  CAST. 

Finally,  the  operational  nature  of  the  test  will  be  more  fully  described. 
Before  this  data  collection  effort  began,  there  was  very  little  information 
regarding  administration  time  and  item  pool  utilization.  Analyses  reported 
herein  compare  the  percentage  of  items  available  with  the  percentage  of  items 
actually  used  in  operational  testing,  and  describe  the  psychometric  charater- 
istics  of  these  items. 
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RESULTS 


Original  Prediction  Equation 

The  correlation  between  CAST  scores  derived  using  the  original  prediction 
equation  and  revised  AFQfT  percentile  scores  (1980  Youth  Norms)  is  .79.  When 
corrected  for  restriction  in  the  range  of  CAST  scores,  the  validity  estimate 
is  .83.  This  estimate  is  somewhat  lower  than  the  estimate  provided  in  Knapp 
and  Pliske  (1986)  which  was  .82  (uncorrected) .  The  drop  in  validity  does  not 
appear  to  be  caused  by  statistical  artifacts  (e.g. ,  differences  in  score 
variance)  and  it  it  is  too  minimal  to  be  of  concern. 

Developing  a  New  Prediction  Equation 

To  develop  a  new  prediction  equation,  the  data  base  (n=5,929)  was  divided 
into  a  development  sanple  and  a  cross-validation  sanple.  Seventy  percent 
(n=4,166)  of  the  examinees  were  included  in  the  development  sanple  and  the 
remaining  examinees  (n=l,763)  conprised  the  cross-validation  sanple.  Examin¬ 
ees  were  selected  for  each  sanple  on  the  basis  of  the  last  digit  of  their 
SSNs. 


APQT  scores  were  regressed  on  final  WK  and  AR  theta  values  in  the 
development  sanple.  The  multiple  correlation  was  .79.  The  resulting  subtest 
weights  were  used  to  compute  CAST  scores  for  the  cross-validation  sanple.  The 
bivariate  correlation  between  these  CAST  scores  and  APQT  percentile  scores  was 
.80.  The  lack  of  shrinkage  is  likely  due  to  the  fact  that  the  equation  was 
developed  using  a  sanple  large  enough  to  provide  stable  estimates  of  the 
regression  weights  and  intercept.  This  revised  regression  equation  was 
incorporated  into  the  operational  CAST  in  late  1986. 

Once  corrected  APQT  1980  Youth  norms  became  available,  the  procedures 
described  above  were  used  to  develop  a  third  regression  equation.  Although 
the  resulting  subtest  weights,  the  multiple  correlation,  and.  the  standard 
error  of  estimate  were  the  same  (within  rounding  error) ,  the  intercept  was 
almost  2  points  higher.  Thus,  the  prediction  equation  currently  incorporated 
into  CAST  yields  APQT  score  predictions  that  tend  to  be  a  couple  of  points  too 
low  across  most  of  the  score  range. 

Validity  of  CAST's  Point  Predictions 

Table  2  shows  the  uncorrected  and  corrected  validity  estimates  for  CAST. 
These  estimates  were  derived  for  the  entire  sanple  and  for  selected  subgroups 
of  the  sanple.  Figure  1  depicts  the  regression  of  APQT  scores  onto  CAST 
scores  for  the  total  sanple.  This  figure  illustrates  CAST's  tendency  to 
underpredict  performance  on  APQT-  The  standard  error  of  estimate  associated 
with  this  regression  is  14  points. 

The  values  in  Table  2  show  that  differences  in  validity  across  racial  and 
gender  subgroups  are  slight.  Statistical  tests  for  subgroup  differences  in 
regression  lines  confirmed  the  results  reported  in  Knapp  et  al.  (1987) .  That 
is,  the  APQfT  performance  of  black  examinees  tends  to  be  overpredicted 
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Table  2 


Bivariate  Correlation  Between  CAST  and  AFUT  Scores  bv 
Race  and  Sex 


Grcuo 

n 

r 

£°a 

All 

5,909 

.79 

.83 

White,  Non-Hispanic 

3,424 

.78 

.83 

Black 

2,244 

.69 

.80 

Hispanic 

241 

.80 

.86 

Male 

4,835 

.80 

.84 

Female 

1,074 

.77 

.82 

aOorrelations  corrected  for  restriction  of  range  in  CAST 
scores. 


relative  to  white  examinees,  and  the  APQT  performance  of  male  examinees  tends 
to  be  overpredicted  relative  to  female  examinees.  These  differences  are 
minimal  and  parallel  those  found  with  other  standardized  cognitive  ability 
tests  (e.g. ,  Dunbar  &  Novick,  1985;  Hanser  &  Grafton,  1982;  Kallingal,  1971). 

Validity  of  CAST's  Category  Predictions 

With  currently  available  data,  it  is  inpossible  to  provide  an  accurate 
portrayal  of  hew  successful  CAST  has  been  with  respect  to  predicting  APQT 
category  classifications.  On  the  basis  of  an  examinee's  CAST  score,  the 
recruiter  predicts  the  APQT  category  to  which  the  examinee  is  likely  to 
belong.  In  one  of  its  Army  regulations,  USAREC  has  provided  recruiters  with  a 
table  that  can  be  used  to  convert  CAST  scores  to  probability  estimates  related 
to  subsequent  classification  into  four  APQT  categories  (see  Pliske,  et  al. , 
1984  for  a  discussion  of  the  development  of  this  table) .  The  extent  to  which 
recruiters  use  this  conversion  table  is  unknown.  Some  recruiters  may  simply 
interpret  CAST  scores  at  face  value.  For  exaitple,  if  an  examinee's  CAST  score 
is  49,  the  recruiter  predicts  APQT  category  IIIB;  whereas  if  the  CAST  score  is 
50,  the  recruiter  predicts  APQT  category  IIIA.  Other  recruiters,  having  noted 
CAST's  tendency  to  underpredict  APQT  performance,  might  conclude  that  an 
examinee  with  a  CAST  score  of  49  is  likely  to  be  in  APQT  category  IIIA.  Thus, 
there  are  several  ways  in  which  a  given  recruiter  may  convert  CAST  point 
predictions  into  category  predictions. 


Figure  2  shews  the  pattern  of  these  predictions  at  two  APQT  category 
outpoints  when  the  assumption  is  that  CAST  scores  are  interpreted  at  face 


Pattern  of  CAST  Predictions  at  Two 
APQT  Category  Cutpoints 


AFQT 

Percentile 

Score 


31  or 
Above 

Belcw 

31 


Above 


Figure  2a 


CAST  SCORE 


Percentile 

Score 


50  xor 


Belcw 

50 


11% 

30% 

Underprediction 

Hit 

51% 

8% 

Hit 

Overprediction 

Belcw 

50 


50  or 
Above 


Figure  2b 


CAST  SCORE 


♦Note  that  the  percentages  in  each  table  total  100; 
APQT  Percentile  based  on  corrected  1980  Norms. 
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value.  A  total  of  81%  of  the  examinees  are  correctly  classified  either  above 
or  below  both  the  IIIB/IVA  and  IIIA/IIIB  outpoints.  At  the  lower  end  of  the 
ability  continuum  (Figure  2a) ,  the  misdassifications  tend  to  be  overpre- 
dictians  (14%)  rather  than  underpredictions  (5%) .  At  the  IIIA/IIIP  atpoint, 
shown  in  Figure  2b,  the  opposite  is  true  —  misdassifications  tend  to  be 
underpredictions  (11%)  rather  than  overpredictions  (8%) .  Under  the  assumption 
that  the  conversion  table  provided  for  recruiters  is  used,  the  overall  hit 
rates  remain  the  same  (81%) .  The  only  difference  is  that  misdassifications 
at  the  IIIB/IVA  outpoint  are  more  likely  to  be  underpredictions  (12%)  rather 
than  overpredicions  (7%) . 


Lationshir 


:itude  Area  Scores 


Table  3  shews  the  bivariate  correlations  between  CAST  scores  and  Army 
aptitude  area  composite  scores.  Several  of  these  correlations  meet  or  exceed 
the  size  of  the  correlation  between  CAST  and  AFQT.  The  relationships  between 
CAST  and  the  combat,  field  artillery,  and  mechanical  maintenance  ccnposites, 
however,  are  probably  too  small  to  be  useful. 


Table  3 


Correlation  Between 


Clerical 

Combat 

Electronic  Maintenance 
Field  Artillery 
General  Maintenance 
Mechanical  Maintenance 
Operators/Food 

Surveillance  and  Ccntrunication 
Skilled  Technical 
General  Technical 


•SS 
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Note.  N  =  5,909 


Administration  Time.  Using  data  from  the  unrestricted  sanple  to  compute 
time  estimates,  the  mean  time  required  to  administer  CAST  is  16  minutes.  This 
estimate  is  several  minutes  higher  than  that  reported  in  Knapp  and  Pliske 
(1986) .  This  is  attributable  to  an  error  in  the  reaction  time  data  field  that 
has  since  been  corrected. 


Perhaps  a  more  meaningful  way  to  present  test  administration  time 
information  is  as  follows.  Twenty- five  percent  of  the  examinees  completed 
CAST  in  less  them  12  minutes,  50%  ccnpleted  CAST  in  less  than  15  minutes  and 
90%  ccnpleted  CAST  within  24  minutes.  No  steps  were  taken  to  trim  the  time 
estimates  to  eliminate  randan  responders  or  examinees  who  were  interrupted 
during  the  course  of  the  test. 

Item  Banks.  Out  of  the  7o  WK  items  available  in  the  CAST  item  bank,  63 
were  administered  15  or  more  times  to  the  14,410  CAST  examinees.  Out  of  the 
225  AR  items  available  for  vise,  only  54  were  used  15  or  more  times  in  this 
sample.  Thus  the  "operational"  item  banks  are  smaller  than  the  "available" 
item  banks. 

Each  CAST  item  has  three  parameter  values  associated  with  it.  The  first 
two  parameters  reflect  the  discriminabil ity  (a-parameter)  and  the  difficulty 
(b-parameter)  of  the  test  item.  The  third  parameter  (c-parameter)  estimates 
the  probability  that  the  item  can  be  answered  correctly  by  guessing.  Urry 
(1974)  outlined  the  item  bank  characteristics  that  would  permit  efficient  and 
accurate  adaptive  testing.  They  are: 

1.  Item  discrimination  values  as  high  as  possible  and  no  lower  than  .80. 

2.  Item  difficulty  values  widely  and  evenly  distributed. 

3  Item  guessing  parameters  as  low  as  possible,  with  .30  as  a  maximum. 

4.  There  should  be  a  sufficient  number  of  items. 

Table  4  shews  the  distribution  of  item  discrimination  values  for  the 
available  and  operational  WK  item  pools.  The  majority  of  items  in  both  item 
pools  have  discrimination  values  between  1.0  and  2.0.  CAST  could  probably 
benefit  from  a  larger  number  of  more  discriminating  items,  however,  all  items 
meet  the  minimum  criterion  suggested  by  Urry.  Since  the  majority  of  available 
WK  items  are  actually  used  (81%) ,  it  is  not  surprising  that  there  is  little 
difference  in  the  distribution  of  discrimination  values  between  the  two  sets 
of  items. 

The  distribution  of  WK  item  difficulty  levels  is  shown  in  Table  5.  Both 
the  operational  and  available  item  pools  eidiibit  a  wide  range  of  difficulty 
values.  The  available  item  pool  has  more  easy  items  (i.e. ,  b  <  0)  than 
difficult  items.  The  distribution  of  difficulty  levels  is  much  more  even  in 
the  operational  item  pool  but  it  is  still  skewed  toward  very  easy  items. 

Tables  6  and  7  shew  the  distribution  of  discrimination  and  difficulty 
parameters  for  the  available  and  operational  AR  item  pools.  Since  the 
operational  item  pool  contains  only  24%  of  the  available  items,  there  are  some 
fairly  striking  differences  between  the  two  sets  of  items.  Although  all  of 
the  AR  items  have  discrimination  values  at  or  above  the  minimum  of  .8,  80%  of 
the  discrimination  values  in  the  available  pool  of  AR  items  are  less  than  1.5. 
In  contrast,  only  48%  of  the  operational  AR  items  have  discrimination  values 
less  than  1.5.  Thus,  there  are  a  large  number  of  AR  items  that  cue  not  used 
because  their  discrimination  values  cure  relatively  low. 
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Table  4 

Distribution  of  WX  item  Discrimination  Levels 


a 

Frequency 

Available  Item 

Percent 

Pool 

Cumulative 

Frequency 

Cumulative 

Percent 

.8  -  .9 

8 

10.3 

8 

10.3 

1.0  -  1.4 

45 

57.7 

53 

67.9 

1.5  -  1.9 

20 

25.6 

73 

93.6 

2.0  -  2.4 

3 

3.8 

76 

97.4 

2.5  -  2.7 

2 

2.6 

78 

100.0 

a 

3 

Frequency 

^rational  Item 

Percent 

Pool 

Cumulative 

Frequency 

Cumulative 

Percent 

.9 

3 

4.8 

3 

4.8 

1.0  -  1.4 

35 

55.6 

38 

60.3 

1.5  -  1.9 

20 

31.7 

58 

92.1 

2.0  -  2.4 

3 

4.8 

61 

96.8 

2.5  -  2.7 

2 

3.2 

63 

iQO.O 

Table  5 


Distribution  of  WK  Item  Difficulty 


b 

Frequency 

Available  Item 

Percent 

Pool 

Cumulative 

Frequency 

Cumulative 

Percent 

-2.0 

to 

-1.5 

17 

21.8 

17 

21.8 

-1.4 

to 

-1.0 

11 

14.1 

28 

35.9 

-0.9 

to 

-0.5 

11 

14.1 

39 

50.0 

-0.4 

to 

0 

11 

14.1 

50 

64.1 

0.1 

to 

0.5 

9 

11.5 

59 

75.6 

0.6 

to 

1.0 

8 

10.3 

67 

85.9 

0.9 

to 

1.5 

5 

6.4 

72 

92.3 

1.6 

to 

2.0 

6 

7.7 

78 

100.0 

b 

3 

Frequency 

jerational  Item 

Percent 

Pool 

Cumulative 

Frequency 

Cumulative 

Percent 

-2.0 

to 

-1.5 

15 

23.8 

15 

23.8 

-1.4 

to 

-1.0 

7 

11.1 

22 

34.9 

-0.9 

to 

-0.5 

6 

9.5 

28 

44.4 

-0.4 

to 

0 

8 

12.7 

36 

57.1 

0.1 

to 

0.5 

8 

12.7 

44 

69.8 

0.6 

to 

1.0 

8 

12.7 

52 

82.5 

0.9 

to 

1.5 

5 

7.9 

57 

90.5 

1.6 

to 

2.0 

6 

9.5 

63 

100.0 
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Table  6 
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Distribution  of  AR  Item  Discrimination  Levels 

s' 

Available  Item  Pool 

Cumulative 

> 

Cumulative  i 

a 

Frequency  Percent  Frequency 

Percent  j 

.7  - 
1  - 
1.5  - 
2.0 


•rational  Item  Pool 


Frequency 


Cumulative 

Frequency 


.7  - 
1  - 
1.5  - 
2.0 


Table  7 


Distribution  of  AS  Item  Difficulty  Levels 


Available  item  Pool 


Frequency 


Percent 


Cumulative 

Frequency 


Cumulative 

Percent 


.9 

7 

13.0 

7 

13. 

1.4 

19 

35.2 

26 

48. 

1.9 

21 

38.9 

47 

87. 

7 

13.0 

54 

100. 

Cumulative 

Percent 


•rational  Item  Pool 


Frequency 


Percent 


Cumulative 

Frequency 


Cumulative 

Percent 


Looking  at  Table  7  one  can  see  that  there  is  a  smaller  range  of  diffi¬ 
culty  covered  by  the  AR  items  than  is  desirable.  In  fact,  the  simplest  level 
of  difficulty  (b  <  -1.5)  is  not  represented  at  all.  Only  23%  of  the  items  in 
the  available  AR  item  pool  have  difficulty  values  less  than  zero.  In  the 
operational  item  pool,  this  percentage  rises  to  44%. 

With  multiple-choice  items,  the  guessing  parameter  values  are  partially 
determined  by  the  number  of  response  alternatives .  Because  CAST  items 
generally  have  five  response  alternatives,  this  puts  a  reasonable  upper  limit 
on  the  size  of  the  c-value.  Table  8  shews  the  distribution  of  c-parameters 
for  the  available  WK  and  AR  item  pools.  Ihe  distribution  of  values  for  the 
operational  item  pools  are  highly  similar  so  they  are  not  shewn.  In  the  WK 
item  pool,  approximately  85%  of  the  c- parameter  values  are  below  .20  and  the 
majority  of  the  values  sire  between  .05  and  .15.  The  AR  items  tend  to  have  c- 
parameters  that  are  somewhat  higher  than  the  WK  items.  Most  of  the  values  are 
between  .15  and  .25.  Approximately  58%  of  the  AR  c-values  are  below  .20. 

Table  8 

Distribution  of  WK  and  AR  Guessing  Parameter  Values 


c 

Frequency 

WK  Item  Pool 

Percent 

Cumulative 

Frequency 

Cumulative 

Percent 

.04 

2 

2.6 

2 

2.6 

.05  -  .09 

26 

33.3 

28 

35.9 

.10  -  .14 

19 

24.4 

47 

60.3 

.15  -  .19 

19 

24.4 

66 

84.6 

.20  -  .24 

9 

11.5 

75 

96.2 

.25  -  .26 

3 

3.8 

78 

100.0 

AR  Item  Pool 

.03  -  .04 

2 

0.9 

2 

0.9 

.05  -  .09 

7 

3.1 

9 

4.0 

.10  -  .14 

23 

10.2 

32 

14.2 

.15  -  .19 

98 

43.6 

130 

57.8 

.20  -  .24 

85 

37.8 

215 

95.6 

.25  -  .30 

10 

4.4 

225 

100.0 

Finally,  Table  9  shews  the  correlations  between  item  parameters  for  the 
different  item  banks.  In  three  cases  (AR  available,  WK  available,  and  WK 
operational) ,  there  is  a  moderate  positive  correlation  between  item  difficulty 
and  item  discrimination.  In  the  AR  operational  item  pool,  however,  this 
relationship  is  quite  large  (r=.834) .  Since  so  few  easy  items  are  available, 


CAST  must  use  items  with  relatively  low  discrimination  values.  Yet  there  are 
a  large  number  of  difficult  items,  so  CAST  uses  only  those  difficult  items 
that  are  also  highly  discriminating .  This  situation  has  resulted  in  the  high 
observed  correlation  between  discrimination  and  difficulty. 


Table  9 

Correlation  Between  Item  Parameters3 


WK  Item  Pools 

AR  Item  Pools 

a 

b 

c 

a 

b  c 

a 

.299 

.214 

.  .276  .083 

b 

.236 

.170 

.834 

N.  -.037 

c 

.224 

.254 

.118 

-.107 

aUpper  diagonal  values  are  from  the  available  item  pools;  lower  diagonal 
values  are  from  the  operational  item  pools. 


In  summary,  both  the  AR  and  WK  item  pools  meet  the  minimum  standards 
outlined  by  Urry  (1974) .  Although  the  size  of  the  WK  item  pool  is  relatively 
snail,  the  item  characteristics  are  quite  acceptable.  The  item  pool  could  be 
improved  by  adding  new  items  that  meet  or  exceed  the  standards  of  the  old,  and 
that  are  focused  on  relatively  high  difficulty  levels.  Despite  it's  size,  the 
AR  item  pool  characteristics  are  less  desirable.  The  most  serious  concern  is 
the  lack  of  easy  items.  The  discrimination  levels  of  the  items  also  tend  to 
be  lower  than  desired  and  the  guessing  values  are  a  bit  high. 

Alternative  Subtest  Lengths 


As  mentioned  earlier,  the  CAST  data  collection  software  recorded  theta 
estimates  after  each  test  item  was  administered.  Using  this  information,  we 
can  compute  multiple  correlation  estimates  for  all  possible  combinations  of 
subtest  lengths  op  to  15  WK  and  10  AR  items.  Table  10  shews  these  estimates 
for  combinations  of  five  or  more  items.  As  one  would  expect,  larger  numbers 
of  test  items  result  in  higher  validity  estimates.  One  must  add  several 
items,  hewever,  to  produce  a  noticable  increase  in  validity. 

Given  that  test  administration  time  is  also  an  important  consideration  in 
determining  test  length,  Table  11  presents  the  mean  administration  times  for 
the  subtest  length  combinations  shewn  in  Table  10.  The  addition  of  AR  items 
adds  appreciably  more  time  to  the  test  than  does  the  addition  of  WK  items. 


Comparison  of  Tables  10  and  11  shews  that  a  change  in  the  subtest  length 
of  CAST  is  not  strongly  supported.  It  would  take  an  average  of  25  minutes  to 
administer  10  AR  and  15  WK  items  (the  longest  subtest  length  that  can  be 
considered  from  these  data) .  The  validity  estimate  would  increase  from  the 
current  .79  to  .83.  The  standard  error  of  estimate  would  decrease  from  14 
points  to  13  points.  Thus,  even  the  maximum  subtest  length  evaluated  here  does 
not  yield  a  particularly  substantial  increase  in  validity. 

SUMMARY 

In  January  1985,  60  Army  recruiting  stations  were  asked  to  begin  forwarding 
CAST  data  to  ART.  This  data  collection  effort  continued  through  the  end  of 
December  1985.  In  this  paper,  the  details  of  the  data  collection  procedures  and 
statistical  analyses  of  the  data  have  been  reported.  Whereas  earlier  reports 
related  to  this  data  collection  effort  focused  on  data  gathered  during  the  first 
six  months  of  1985,  the  analyses  reported  herein  are  based  on  the  entire  CAST 
dataset. 

The  data  collection  effort  in  1985  resulted  in  a  large  amount  of  useful 
information.  It  has  provided  solid  evidence  of  CAST's  ability  to  predict  APQT 
performance.  This  evidence  was  needed  to  provide  a  clear  justification  for 
using  CAST  as  a  tool  for  screening  potential  Army  applicants.  Despite  the 
relatively  strong  relationship  between  CAST  and  APQT  scores,  hewever,  this 
screening  test  could  be  refined  to  better  suit  the  Army's  needs.  In  fact, 
refinement  efforts  are  currently  underway,  and  these  efforts  rely  heavily  on 
information  from  the  1985  data  base. 

Possible  immediate  changes  to  CAST  include  the  way  in  which  the  results  are 
displayed,  the  algorithm  used  to  compute  APQT  percentile  scores,  and  the  length 
of  the  two  subtests.  Each  of  these  areas  of  potential  change  will  be  briefly 
reviewed. 

ARI  has  suggested  several  alternative  ways  to  present  CAST  results  to 
recruiters  (Knapp,  1987) .  Basically,  two  approaches  were  considered.  One 
approach  is  based  on  the  prediction  intervals  associated  with  the  CAST  estimates 
of  APQT  percentile  scores.  The  second  approach  is  based  on  the  estimated 
probability  that  an  individual  with  a  given  CAST  score  will  fail  into  one  of 
three  or  four  APQT  performance  categories.  The  information  needed  to  program 
these  alternative  output  displays  into  the  CAST  software  could  only  be  derived 
from  a  data  base  such  as  that  created  in  1985. 

These  data  also  allcw  the  computation  of  a  stable  and  precise  algorithm  for 
deriving  predicted  APQT  percentile  scores  from  the  CAST  subtest  scores.  Despite 
potential  changes  in  the  way  in  which  AFQT  scores  are  computed  (i.e. ,  replacing 
the  Numerical  Operations  subtest  with  Math  Kncwledge)  and  previous  changes  in 
the  derivation  of  APQT  percentile  scores,  this  data  base  can  provide  the  appro¬ 
priate  prediction  algorithm  as  needed.  A  software  change  to  correct  the 
intercept  of  the  current  algorithm  and  the  display  of  CAST  results  is  pending. 

Finally,  analyses  reported  herein  related  to  the  changes  in  predictive 
ability  and  test  administration  time  as  a  function  of  subtest  length  have  been 


vised  to  reevaluate  the  current  subtest  length  of  CAST.  The  addition  or  deletion 
of  several  WK  items  has  little  effect  on  either  the  validity  estimate  or  the 
testing  time.  Presently,  so  few  AR  items  are  administered  that  it  would  be  very 
risky  to  consider  reducing  their  number.  Adding  AR  items  significantly 
increases  testing  time  (1-2  minutes  per  item)  with  little  payoff  in  terms  of 
increased  predictive  accuracy.  Ihus  these  data  seem  to  justify  the  current  CAST 
subtest  length.  Note,  however,  that  future  changes  to  CAST  that  affect  the 
internal  testing  strategy  may  influence  the  subtest  length  issue. 

As  a  result  of  a  major  refinement  effort  that  began  in  1987,  a  new  version 
of  CAST,  to  be  known  as  CAST  II,  will  be  developed.  The  focus  of  this  refine¬ 
ment  project  will  be  to  reconsider  the  CAST  subtest  scoring  and  item  selection 
algorithms  and  to  improve  the  item  banks.  As  part  of  this  refinement  project, 
another  large  scale  data  collection  will  be  required.  Item  calibration  data 
will  be  collected  from  new  recruits  at  Army  Reception  Battalions  and  from  CAST 
examinees  at  recruiting  stations.  CAST  II  will  be  available  for  operational 
use  in  1988. 

Hie  1985  CAST  data  base  continues  to  provide  information  that  directs  this 
major  refinement  effort.  Hie  most  obvious  example  is  related  to  the  item 
selection  rule  and  improvement  of  the  item  pools.  ARI's  existing  data  base 
confirms  that  seme  items  are  over-used  and  clearly  shews  the  pattern  of  item 
usage.  This  information  will  help  to  determine  a  more  appropriate  item  selec¬ 
tion  algorithm  and  to  decide  if  seme  test  items  should  be  deleted  from  the  item 
banks. 

Assuming  that  the  developmental  work  for  CAST  II  will  result  in  an  enduring 
internal  testing  framework,  the  remaining  problem  will  be  to  ensure  that  the 
item  banks  and  APQT  prediction  algorithm  are  periodically  monitored  and  updated. 
A  special  version  of  the  CAST  II  software  will  have  the  capability  of  collecting 
data  that  can  be  used  to  accomplish  this  maintenance  function  in  a  relatively 
unobtrusive  manner. 

Although  adaptive  testing  is  very  efficient  when  compared  to  traditional 
testing,  it  can  be  quite  costly  in  research  and  development  resources.  The 
primary  problem  is  that  each  potential  test  item  needs  to  be  administered  to 
close  to  2,000  people  to  provide  an  adequate  assessment  of  its  psychometric 
properties.  Rather  than  collecting  such  data  all  at  once,  it  is  possible  to 
collect  the  data  a  little  at  a  time.  That  is,  one  can  embed  several  non-scored 
test  items  into  the  operational  version  of  the  test  and  record  the  item  response 
information  for  future  research  use. 

Thus,  the  long-term  maintenance  program  calls  for  the  periodic  addition  of 
experimental  items  to  the  operational  CAST  II  software.  These  items  will  be 
administered  in  a  manner  that  will  be  transparent  to  both  the  examinees  and  the 
recruiters.  Data  from  these  items,  CAST  performance  scores,  and  examinee  SSN 
will  be  electronically  transmitted  from  recruiting  stations  to  a  central  data 
base.  As  time  passes,  sufficient  data  will  become  available  to  calibrate  the 
experimental  test  items.  Periodic  statistical  analysis  of  these  data  and 
examination  of  operational  item  usage  information  will  allow  regular  updating  of 
the  item  banks.  Also  at  regular  intervals,  CAST  performance  scores  will  be 
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